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We provide yet another proof of the existence of calibrated forecasters; it has two merits. First, it is valid 
for an arbitrary finite number of outcomes. Second, it is short and simple and it follows from a direct 
application of Blackwell's approachability theorem to carefully chosen vector-valued payoff function and convex 
target set. Our proof captures the essence of existing proofs based on approachability (e.g., the proof by Fos- 
ter [5] in case of binary outcomes) and highlights the intrinsic connection between approachability and calibration. 
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1. Motivation. Foster [5] stated that: 

Over the past few years many proofs of the existence of calibration have been discovered. Each 
of the following provides a different algorithm and proof of convergence: Foster and Vohra 6,8, 
Hart |12| . Fudenberg and Levine [10], Hart and Mas-Colell |13| . Does the literature really need 
one more? Probably not. 
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In spite of this, he argued, successfully, that his new proof of the existence of calibrated forecasters in the 
case of binary outcomes, based on Blackwell's approachability theorem (Blackwell [1]), was shorter and 
, more direct than most of the previous proofs. 

In this paper, we consider the general case of finitely many outcomes and exhibit an even shorter (ten- 
(N ■ line long) proof of the existence of calibrated forecasters based on approachability. We show therefore 

that calibration is a straightforward consequence of approachability. As we realized by browsing on the 
web, approachability and calibration are well-taught matters and we are confident that this new proof 
will become a standard example in the list of direct applications of approachability (as is already the 
case for the existence of no- regret forecasters). Since calibration is a central tool in learning in games 
(see, e.g., Kakade and Foster [T3]) and in online learning (see, e.g., Mannor, Tsitsiklis, and Yu [U]), the 
simplicity of the proof and the guaranteed convergence rates open up new opportunities to use calibration 
in practical learning algorithms. 

Foster [5] mentions that his approachability-based proof of the existence of a calibrated forecaster 
was obtained by first considering a modification of an intuitive forecaster already stated in Foster and 
Vohra [5] and then working out the proof of its guarantees. We proceed the other way round and start 
directly from the statement of Blackwell's approachability theorem for convex sets Q] Theorem 3] but, 
as a drawback, can only exhibit a forecaster which has to solve a linear program at each step. Taking 
a closer look at Foster [S], one can see that we indeed capture the essence of his previous proof. His 
algorithm is a clever modification, in the case of binary outcomes, of the general approachability-based 
forecaster presented below; the former has a nice, explicit, and simple statement. 



We now recall the informal definition and consequences of calibration. Consider a finite set of possible 
outcomes and suppose we obtain random forecasts about future events; these forecasts are each given by 
probability distributions over the outcomes. Now, such a sequence of forecasts is called calibrated when- 
ever it is consistent in hindsight, that is, when, for all distributions p, the actual empirical distribution 
of the outcomes on those rounds when the forecast was close to p is also close to p. 
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Having a calibrated forecasting scheme is beneficial in several ways. On the one hand, it allows some 
agent to choose the best responses to the predicted forecasts or to consider other risk measures which 
might be more valuable than greedily choosing the best action leading to highest reward. On the other 
hand, calibrated forecasting rules enable multiple agents to converge to a reasonable joint play in some 
situations. For instance, if all players use calibrated forecasts of other players' actions, then the empirical 
distribution of action profiles converges to the set of correlated equilibria; see Foster and Vohra [TJ. We 
refer to Sandroni, Smorodinsky, and Vohra |19j for further discussion on calibrated forecasting as well as 
its generalizations. 

2. Setup and formal definition of calibration. We consider a finite set A of outcomes, with 
cardinality denoted by A and denote by V = A (.A) the set of probability distributions over A. We equip 
V, which can be considered a subset of R" 4 , with som43 norm |j • ||, to be referred to as the calibration 
norm. In particular, the Dirac probability distribution on some outcome a G A will be referred to as 8 a . 

A forecaster plays a game against Nature. At each step, it outputs a probability distribution P t e V 
while Nature chooses simultaneously an outcome at € A. We make no assumption on Nature's strategy. 

The goal of the forecaster is to ensure the following property, known as calibration: for all strategies 
of Nature, 



Ve>0, VpeP, 
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The a.s. statement accounts for randomized forecasters. (It was shown by Oakes [16] and Dawid [4] that 
randomization is essential for calibration.) 



The literature (e.g., Foster and Vohra [8], Foster [5]) essentially considers a less ambitious goal, at 
least in a first step: e-calibration. (We explain in Section 14.21 how to get a calibrated forecaster from 
some sequence of e-calibrated forecasters with good properties.) Formally, given e > 0, an e-calibrated 
forecaster considers some finite covering of V by N £ balls of radius e and abides by the following con- 
straints. Denoting by pi, . . . , pn c the centers of the balls in the covering (they form what will be referred 
to later on as an e-grid), the forecaster chooses only forecasts Pt <E {pi, . . . , Pn e }• We thus denote by 
K t the index in {l, . . . , N e } such that Pt — Px t - The final condition to be satisfied is then that for all 
strategies of Nature, 



lim sup 



E 
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When the calibration norm is the l^-norm || • the sum appearing in this criterion is usually referred 
to as the ^-calibration score (Foster [5]). Another popular criterion is the Brier score (Foster and 
Vohra [8 ), which we consider in Section 1431 it is bounded, up to a factor of 2, by the ^-calibration 
score. 



3. A geometric construction of e— calibrated forecasters. In this section we prove our main 
result regarding the existence of an e-calibrated forecaster based on approachability theory. We recall 
results approachability theory, provide the main result (Theorem I3.2[) . and then address the issue of 
computational complexity. 



3.1 Statement of Blackwell's approachability theorem. Consider a vector- valued game be- 
tween two players, with respective finite action sets X and J . We denote by d the dimension of the 
reward vector. The payoff function of the first player is given by a mapping m : 1 x J — > R d , which is 
linearly extended to A (I) x A(J~), the set of product-distributions over 1 x J. 

We denote by I±, I2, . . ■ and J\, J2, . . . the sequences of actions in I and J taken by each player (they 
are possibly given by randomized strategies). Let C C M. d be some set. By definition, C is approachable 



1 The precise nature of this norm, e.g., I 1 , Euclidian £ 2 , or i°° supremum norm, is irrelevant at this stage, since all norms 
are equivalent on finite-dimensional spaces. 
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if there exists a strategy for the first player such that for all strategies of the second player, 

T 

T 



lim inf 

T-s-oc ceC 



1 T 

c- ■=22m(l t ,J t ) 



a.s. 



That is, the first player has a strategy that ensures that the average of his vector- valued payoffs converges 
to the set C. 

For closed convex sets C, there is a simple characterization of approachability that is a direct conse- 
quence of the minimax theorem. 

Theorem 3.1 (Blackwell [H Theorem 3]) A closed convex set C c M. d is approachable if and only 
if 

VqeA(J), 3 P eA(I), m(p,q)eC. 

3.2 Application to the existence of an e calibrated forecaster. As indicated above, we equip 

V with some calibration norm || • || and fix e > 0; we then consider an associated e-grid {pi, . . . , pjv e } in 

V = A(A). 

Theorem 3.2 There exists an e- calibrated forecaster which selects at every stage a distribution from this 
grid. 

Proof. We apply the results on approachability recalled above. To that end, we consider in our 
setting the action sets X = {1, . . . ,N e } for the first player and J — A for the second player. 



We define the vector-valued payoff function as follows; it takes values in R ANs . For all A; € {1, . . . , N e } 
and a G A, 

m(k, a) = (0, . . . , 0, p fc - S a , 0, . . . , 0) , 

which is a vector of N £ elements of M. A composed by N e — 1 occurrences of the zero element G M" 4 and 
one non-zero element, located in the fc-th position and given by the difference of probability distributions 

Pfc - 5a- 

We now define the target set C as the following subset of the e-ball around (0, . . . , 0) for the calibration 
norm || • ||. We write (A/V E )-dimensional vectors of R ANc as N E -dimensional vectors with components in 
R A , i.e., for all £6 R AN * , 

21 = (-Ei ! • • • ) 2LN e ) j 



where x k G R A for all k G {1, ... , N E }. Then, 

C = |x G R AN ' 
Note that C is a closed convex set. 



fe=i j 



The condition ^ of e-calibration can be rewritten as follows: the sequence of the vector-valued 
rewards 



def 

rn-r = 
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converges to the set C almost surely. 

The existence of an e-calibrated forecaster is thus equivalent to the approachability of C, which we 
now prove by showing that the characterization provided by Theorem 13. II is satisfied. Let q G A(J') = V . 
By construction, there exists k G {1, . . . , N e } such that ||p^ — q|| $C e and thus 

m(k, q) G C . 

(Here, the distribution p of the approachability theorem can be taken as the Dirac distribution Sk-) □ 
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3.3 Computation of the exhibited e calibrated forecaster. The proof of the approachability 
theorem gives rise to an implicit strategy, as indicated in Blackwell pQ. We denote here by lie the 
projection in ^ 2 -norm onto C . 

At each round t ^ 2 and with the notations above, the forecaster should pick his action K t at random 
according to a distribution ipt = (}Pt,i> • ■ • , 4 ! t.N c ') on jl, . . . , N e } such that 

Va e A, (mt-i - II c (m t _i)) • (m(ip t , a) - n c (m t -i)) < , (3) 

where • denotes the inner product in E AAfe . The proof of Theorem 13.11 (sec Blackwell pQ) shows that 
such a distribution ip t indeed exists; the question is how to efficiently compute it. To do so, we first need 
to compute the projection Hc(m t -i) of m*_i. 

We address the two computational issues separately. We first indicate how to find the projection 
efficiently and then explain how to find the distribution ip t based on the knowledge of this projection. 

3.3.1 Projecting onto C. We need to find the closest point in C to TOj-i- Since C is convex and 
the ^ 2 -norm is convex, we have to deal with a minimization problem of a convex function over a convex 
set. Since answering the question whether a given point is in C or not can be done in time linear in AN e , 
the projection problem can be solved (approximately) in time polynomial in AN e . 



Now, for the special case where the calibration norm is the ^-norm || • we can do much better. 
For i e {l, . . . , AN E \, we denote by s^t-x € {— 1, 1} the sign of the i-th component m^t-i of the vector 
rn t -\. (The value of the sign function at x is arbitrary at x — 0, equal to —1 when x < and to 1 when 
x > 0.) Then, Ilc(rnt-i) is the solution of the following optimization problem, where the unknown is 
g = (yi, ■ ■ -,yAN e ): 

II ■ , 2 

mm |||-m t _i|| 2 

AN E 



such that 



^ yiSi t t-\ < e 



i=i 



I/iSi,t_i>0, Vte{l,...,A/V e } 



It can be easily shown (as in Gafni and Bertsekas [TT] or by an immediate adaptation of Palomar [TT1 
Lemma 1]) that the optimal solution is unique; it is given by y(/i*) where for all /i ^ 0, 



y{jj) = Si >t -i (si,t-i m^t-i - /i 



+ 



and n* is chosen as the minimum nonnegative value such that X^^l/ 1 ) s i-t-i *s £• (Note that if /i* > 
then 53i2/»(M*) s i,t~i — £•) Finding ji* can be done by a binary search to an arbitrary precision. 

In conclusion, when the calibration norm is the ^-norm || • 1^, projecting onto C can be done in linear 
time in AN S to a desired precision S with complexity that depends on S like log(l/i5). 

3.4 Finding the optimal distribution ij> t in ([3]). The question that has to be resolved is therefore 
how to find ipt that satisfies condition ([3]). Since we know that such a ipt exists, it suffices, for instance, 
to compute an element of 

argmin max (m t _i -iWmt-i) J • rn(ip, a) = argmin max ipkjk,a,t-x 



where we denoted Jk,a,t-i — y^t-i — Hc( m t-i)j ■ m(k, a). 

This can be done efficiently by linear programming leading to a polynomial complexity in N £ and A. 



However, if instead of solving the minimax problem exactly we are satisfied with solving it approx- 
imately, i.e., allowing a small violation S > in each of the A constraints given by (O, we can use 
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the multiplicative weights algorithm as explained in Freund and Schapire [5]; see also Cesa-Bianchi and 
Lugosi [2j Section 7.2]. The complexity of such a solution would be 

since (\nN e )/S 2 steps of complexity AN e each have to be performed. 

The proof of Blackwell's approachability theorem shows that in this case the sequence of the average 
payoff vectors m t converges rather to the v^-expansion (in ^ 2 -norm) of C; it is easjH to see that the 
latter is included in the (5-expansion (in £ 1 -norm) of C. 

Putting all things together and taking the ^-norm || • 1^ as the calibration norm (in particular, to 
define C), we can find a 2e-calibrated forecaster whose complexity is of the order of AN e e~ 2 log N e at 
each step. Since N e behaves like e - '-' 4-1 ) we have that the dependence of the complexity per stage behaves 
like (ignoring multiplicative and logarithmic factors). This implies a polynomial dependence in 

e but an exponential dependence in A. 

Remark 3.1 It is worth noting that when choosing a solution ip t , it is not possible to replace tpt with 
its mean or with an element o/pi,P2, • • • ,Pn s that is close to its mean. The reason is that this would 
give rise to a deterministic rule, which, as we mentioned in Section fj[ cannot be calibrated. The fact 
that we have to randomize rather than take the mean is due to our construction of the vector-valued 
game; therein, playing a mixed action ip t over the Pi 's leads to a very different vector-valued reward than 
playing the ( element p^. closest to the ) mean of the mixed action. This is because different indices of the 
{AN e ) -dimensional space are involved. 

4. Rates of convergence and construction of a calibrated forecaster. In this section we 
provide rates of convergence and discuss the construction of a calibrated (rather than e-calibrated) 
forecaster. We finally compare our results to some existing calibrated forecasters in the literature. 

The main result of this section is providing rates of convergence for a calibrated forecaster in ([5]). To 
the best of our knowledge, this is the first rates results for calibration for an alphabet of size A larger 
than 2. For A — 2, (sub)optimal rates follow from the procedure of Foster and Vohra [8] as recalled in 
Section fO^l 

4.1 Rates of convergence. Approachability theory provides uniform convergence rates of sequence 

of empirical payoff vectors to the target set, see Cesa-Bianchi and Lugosi (2j Exercise 7.23]. Formally, 

denoting by || • || 9 the Euclidian £ 2 -norm in M. ANc , it follows in our context that there exists some absolute 

constant 7 (independent of A and N s ) such that for all strategies of Nature and for all T, with probability 

1-6, 

,,_ r — n 11 /ln(l/<S) 

|| tot He (tot J || 2 ^ 7V — J, — ■ 

Here, it is crucial to state the convergence rates based on the Euclidian norm because of an underlying 
martingale convergence argument in Hilbert spaces proved by Chen and White [3]. The reason why the 
convergence rate here is independent of A and N e is that the payoff vectors m(k, a) all have an Euclidian 
norm bounded by an absolute constant, e.g., 2; this happens because most of their components are 0. 

We now apply this result. However, we underline that the set C can be defined by a different calibration 
norm || • || ; below, we will define it based on the £ 1 -norm, for instance. But the stated uniform convergence 
rate can be used since, via a triangle inequality and an application of the Cauchy-Schwarz inequality, 

|| tot Hi s$ || Ilc(mr) || t + || tot - Ilc(mr) || 1 ^ e + \/ AN £ j| tot - Hc(to t ) | 2 . 

N e is of the order of e^" 4-1 ); we let 7' be an absolute constant such that N e ^ 7' for all e ^ 1 

(say). We therefore have proved that given < e ^ 1, the forecaster defined in the previous section is 
such that for all strategies of Nature and for all T, with probability 1—5, 

N c T 



iwriii = 

k=l 



T 



1 



2 It suffices to note that for all vectors A of a finite-dimensional space, one has HAHoo ^ ||A||2, so that the inequality 
||A|| 2 x/Plk ll A lli y ields VWh < l|A||i. 



() 
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This high-probability bound is to be used below as the key ingredient to construct a calibrated forecaster, 
i.e., a forecaster satisfying ([lj. Combining the Borel-Cantelli Lemma with the bound above shows that 
the less ambitious goal © can be achieved. 



4.2 Construction of a calibrated forecaster. We use a standard approach which is commonly 
known as the "doubling trick," see, e.g., Cesa-Bianchi and Lugosi [2], It consists of defining a meta- 
forecaster that proceeds in regimes; regime r (where r 1) lasts T r rounds and resorts for the forecasts 
to an e r ~calibrated forecaster, for some e r > to be defined by the analysis. We now show that for 
appropriate values of the T r and e r , the resulting meta- forecaster is calibrated in the sense of ([1]), and 
even uniformly calibrated in the following sense, where B denotes the Borel sigma-algebra of V: 



lim sup 

T^ + oo BeB 



1 T 







a.s. 



(4) 



Of course, uniform calibration (j4]) implies calibration ([T]) via the choices for B given by e-balls around 
probability distributions p. 

For concreteness, we focus below on the ^-calibration score. 



Regimes are indexed by r = 1, 2, . . . and the index of the regime corresponding to round T is referred to 
as Rt ■ The set of the rounds within regime r ^ Rt — 1 is called T r ; rounds in regime Rt with index less 
than T are gathered in the set Tr t (we commit here an abuse of notations). We denote by pfc r , where 
k € {1, . . . , N Br }, the finite e r -grid considered in the r-th regime. By the triangle inequality satisfied by 
|| • ||, we first decompose the quantity of interest according to the regimes and to the played points of the 
grids, 

Rt N e r 
l r—l k.—l 

We now substitue the uniform bound obtained in the previous section and get that with probability 
1 _ ( S + . . . + j ) > i _ i/T 2 , 



T 

t=i 



(Pt-K) 



E l {K t =k}(pk 



sup 

BeB 



1 T 1 Rt 



r=l 



where we defined S r ^T — l/(2 r T 2 ). 

An application of the Borel-Cantelli Lemma and Cesaro's Lemma shows that for suitable choices of a 
sequence e r decreasing towards and an increasing sequence T r such that e^~ x T r tends to infinity fast 
enough, one then gets the desired convergence (|4]). For instance, if T r — 2 r , and e r is chosen such that 



and 



1 



cr J-r 



are of the same order of magnitude, e.g., e r — 2 r /( j4 + 1 ) j then 



T i/(A+i) 
lim sup — -j^=^ sup 
t-s-oo VlnT BeB 



1 T 



(5) 



where the constant Ta depends only on A. As indicated above, to the best of our knowledge, this is the 
first rates results for calibration for an alphabet of size A larger than 2. 

4.3 Comparison to previous forecasters. 



4.3.1 ^■'■—calibration score. Foster [S] first considered the ^-calibration score in the context of the 
prediction of binary outcomes only, i.e., when A = 2. The e-calibrated forecaster he explicitly exhibited 
has a computational complexity of the order of 1/e. He did not work out the convergence rates but since 
his procedure is mostly a clever twist on our general procedure, they should be similar to the ones we 
proved in Section HTTl 
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4.3.2 Brier score. What follows is extracted from Foster and Vohra [§]; see also Cesa-Bianchi and 
Lugosi [2, Section 4.5]. 

Given an e-grid over the simplex V, we define, for all k G {1, . . . , N e }, the empirical distribution of 
the outcomes chosen by Nature at those rounds t when the forecaster used , 

!Pk if ELi l {K t =k} = 0, 
^ 1 

2^ hK t =k} T tt S °-t lf Et=l l {K t =k} > o. 
4=1 Et=l HK a =k} 

The classical Brier score can be shown in our setup to be equal to the following criterion: 



N ' 2 (l T \ 

22\\pr(k) -PfcHa yXW) 

fc=i V t=i J 



Since for two probability distributions p and q of V , one always has 

Hp - qlll < 2 Hp - qlli > 

the Brier score can be seen to be upper bounded by twice the ^-calibration score; it is thus a weaker 
criterion. 



Cesa-Bianchi and Lugosi Section 4.5] shows however that forecasters with Brier scores asymptot- 
ically smaller than e can be the keystones to construct calibrated forecasters, in a way similar to the 
construction exhibited in Section |4~^1 

In the case A = 2, these forecasters essentially bound the Brier score, with probability at least 1 — 6, 
by a term that is of the order of 

£ + l\ T ' 

which is worse than the rate we could exhibit in Section |4~T] for the ^-calibration score. 

In addition, the computational complexity of the underlying procedure (based on the minimization of 
internal regret) is of the order of 1/e 2 per stage and thus is similar to the complexity l/e A+1 = 1/e 2 we 
derived in Section |3~31 for our new procedure. 



The general case of A ^ 3 is briefly mentioned in Cesa-Bianchi and Lugosi (H Section 4.5] indicating 
that the case of A = 2 can be extended to A 3 without further details. As far as we can say, the 
computational complexity of such an extension per step would be of the order of l/e 2 ^ -1 * 1 versus l/e'" 4 " 1 " 1 ' 
for the approachability-based procedure we suggested above. The convergence rates, for a straightforward 
extension, seem to be quite slow. However, based on a draft of the present article, Perchet [IB] recently 
proposed a more efficient extension of the procedure of Foster and Vohra [8] and obtained the same rates 
of convergence as in ([5]); he however did not work out the complexity of his procedure, which seems to 
be similar to the one of our construction. 
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